INTRODUCTION:

Given from the sinister title, we will be examining gun violence in the United States to see if we can form any predictions and correctly hypothesize based on the data. Gun Violence is extremely prevelant in the United States and the polarizing topic of gun laws is talked about daily. It is obviously a very important topic and we will explore a basic start to diving into the data and seeing if we would be able to predict things like the number of deaths due to gun violence in America, or to see if we will be able to predict which areas (Cities | States) of the country will be effected the most by gun violence. While this project examines a very serious topic, it will be a good subject to give a tutorial to data pipelining with because it is loaded with raw data. In the data set being used we are given the number of wounded victims, the number of deaths, the latitude & longitude of the incedents, the date of the incidents and more. In some incidents we are even provided the type of gun used.

GATHERING DATA:

The first step is downloading and importing the data we will be examining. We will be using data from :

https://www.kaggle.com/jameslko/gun-violence-data/version/1#

Download the data file and save it as a csv files.

library(tidyverse)
## -- Attaching packages --------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
gun_violence <- read.csv("gun_violence.csv")
head(gun_violence)
##   incident_id     date          state city_or_county
## 1      461105 1/1/2013   Pennsylvania     Mckeesport
## 2      460726 1/1/2013     California      Hawthorne
## 3      478855 1/1/2013           Ohio         Lorain
## 4      478925 1/5/2013       Colorado         Aurora
## 5      478959 1/7/2013 North Carolina     Greensboro
## 6      478948 1/7/2013       Oklahoma          Tulsa
##                                     address n_killed n_injured
## 1 1506 Versailles Avenue and Coursin Street        0         4
## 2              13500 block of Cerise Avenue        1         3
## 3                     1776 East 28th Street        1         3
## 4          16000 block of East Ithaca Place        4         0
## 5                 307 Mourning Dove Terrace        2         2
## 6                6000 block of South Owasso        4         0
##                                        incident_url
## 1 http://www.gunviolencearchive.org/incident/461105
## 2 http://www.gunviolencearchive.org/incident/460726
## 3 http://www.gunviolencearchive.org/incident/478855
## 4 http://www.gunviolencearchive.org/incident/478925
## 5 http://www.gunviolencearchive.org/incident/478959
## 6 http://www.gunviolencearchive.org/incident/478948
##                                                                                                                      source_url
## 1 http://www.post-gazette.com/local/south/2013/01/17/Man-arrested-in-New-Year-s-Eve-shooting-in-McKeesport/stories/201301170275
## 2                                                               http://www.dailybulletin.com/article/zz/20130105/NEWS/130109127
## 3                                  http://chronicle.northcoastnow.com/2013/02/14/2-men-indicted-in-new-years-day-lorain-murder/
## 4                              http://www.dailydemocrat.com/20130106/aurora-shootout-killer-was-frenetic-talented-neighbor-says
## 5                                        http://www.journalnow.com/news/local/article_d4c723e8-5a0f-11e2-a1fa-0019bb30f31a.html
## 6                 http://usnews.nbcnews.com/_news/2013/01/07/16397584-police-four-women-found-dead-in-tulsa-okla-apartment?lite
##   incident_url_fields_missing congressional_district
## 1                       FALSE                     14
## 2                       FALSE                     43
## 3                       FALSE                      9
## 4                       FALSE                      6
## 5                       FALSE                      6
## 6                       FALSE                      1
##               gun_stolen               gun_type
## 1                                              
## 2                                              
## 3 0::Unknown||1::Unknown 0::Unknown||1::Unknown
## 4                                              
## 5 0::Unknown||1::Unknown 0::Handgun||1::Handgun
## 6                                              
##                                                                                                                                                                                                                                                            incident_characteristics
## 1                                        Shot - Wounded/Injured||Mass Shooting (4+ victims injured or killed excluding the subject/suspect/perpetrator, one location)||Possession (gun(s) found during commission of other crimes)||Possession of gun by felon or prohibited person
## 2                                                                                         Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Mass Shooting (4+ victims injured or killed excluding the subject/suspect/perpetrator, one location)||Gang involvement
## 3                                                                                                                                      Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Shots Fired - No Injuries||Bar/club incident - in or around establishment
## 4 Shot - Dead (murder, accidental, suicide)||Officer Involved Incident||Officer Involved Shooting - subject/suspect/perpetrator killed||Drug involvement||Kidnapping/abductions/hostage||Under the influence of alcohol or drugs (only applies to the subject/suspect/perpetrator )
## 5                                                                                                              Shot - Wounded/Injured||Shot - Dead (murder, accidental, suicide)||Suicide^||Murder/Suicide||Attempted Murder/Suicide (one variable unsuccessful)||Domestic Violence
## 6                     Shot - Dead (murder, accidental, suicide)||Home Invasion||Home Invasion - Resident killed||Mass Shooting (4+ victims injured or killed excluding the subject/suspect/perpetrator, one location)||Armed robbery with injury/death and/or evidence of DGU found
##   latitude location_description longitude n_guns_involved
## 1  40.3467                       -79.8559              NA
## 2  33.9090                      -118.3330              NA
## 3  41.4455          Cotton Club  -82.1377               2
## 4  39.6518                      -104.8020              NA
## 5  36.1140                       -79.9569               2
## 6  36.2405     Fairmont Terrace  -95.9768              NA
##                                                                                                                             notes
## 1                                                                          Julian Sims under investigation: Four Shot and Injured
## 2                                                                      Four Shot; One Killed; Unidentified shooter in getaway car
## 3                                                                                                                                
## 4                                                                                                                                
## 5 Two firearms recovered. (Attempted) murder suicide - both succeeded in fulfilling an M/S and did not succeed, based on details.
## 6                                                                                                                                
##                     participant_age
## 1                             0::20
## 2                             0::20
## 3 0::25||1::31||2::33||3::34||4::33
## 4        0::29||1::33||2::56||3::33
## 5        0::18||1::46||2::14||3::47
## 6        0::23||1::23||2::33||3::55
##                                                                participant_age_group
## 1               0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+||4::Adult 18+
## 2                             0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+
## 3               0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+||4::Adult 18+
## 4                             0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+
## 5                            0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::Adult 18+
## 6 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+||4::Adult 18+||5::Adult 18+
##                                             participant_gender
## 1                         0::Male||1::Male||3::Male||4::Female
## 2                                                      0::Male
## 3                  0::Male||1::Male||2::Male||3::Male||4::Male
## 4                         0::Female||1::Male||2::Male||3::Male
## 5                       0::Female||1::Male||2::Male||3::Female
## 6 0::Female||1::Female||2::Female||3::Female||4::Male||5::Male
##                                                                                            participant_name
## 1                                                                                            0::Julian Sims
## 2                                                                                         0::Bernard Gillis
## 3                      0::Damien Bell||1::Desmen Noble||2::Herman Seagers||3::Ladd Tate Sr||4::Tallis Moore
## 4                       0::Stacie Philbrook||1::Christopher Ratliffe||2::Anthony Ticali||3::Sonny Archuleta
## 5       0::Danielle Imani Jameison||1::Maurice Eugene Edmonds, Sr.||2::Maurice Edmonds II||3::Sandra Palmer
## 6 0::Rebeika Powell||1::Kayetie Melchor||2::Misty Nunley||3::Julie Jackson||4::James Poore||5::Cedric Poore
##   participant_relationship
## 1                         
## 2                         
## 3                         
## 4                         
## 5                3::Family
## 6                         
##                                                                         participant_status
## 1                              0::Arrested||1::Injured||2::Injured||3::Injured||4::Injured
## 2                                            0::Killed||1::Injured||2::Injured||3::Injured
## 3 0::Injured, Unharmed, Arrested||1::Unharmed, Arrested||2::Killed||3::Injured||4::Injured
## 4                                               0::Killed||1::Killed||2::Killed||3::Killed
## 5                                             0::Injured||1::Injured||2::Killed||3::Killed
## 6 0::Killed||1::Killed||2::Killed||3::Killed||4::Unharmed, Arrested||5::Unharmed, Arrested
##                                                                     participant_type
## 1                     0::Victim||1::Victim||2::Victim||3::Victim||4::Subject-Suspect
## 2                     0::Victim||1::Victim||2::Victim||3::Victim||4::Subject-Suspect
## 3            0::Subject-Suspect||1::Subject-Suspect||2::Victim||3::Victim||4::Victim
## 4                                0::Victim||1::Victim||2::Victim||3::Subject-Suspect
## 5                                0::Victim||1::Victim||2::Victim||3::Subject-Suspect
## 6 0::Victim||1::Victim||2::Victim||3::Victim||4::Subject-Suspect||5::Subject-Suspect
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         sources
## 1                                                                                                                                                                                                                    http://pittsburgh.cbslocal.com/2013/01/01/4-people-shot-in-mckeesport/||http://www.wtae.com/news/local/allegheny/U-S-Marshals-task-force-arrests-New-Year-s-party-shooting-suspect/17977588||http://www.post-gazette.com/local/south/2013/01/17/Man-arrested-in-New-Year-s-Eve-shooting-in-McKeesport/stories/201301170275
## 2                                                                                                                                                                                                                 http://losangeles.cbslocal.com/2013/01/01/man-killed-3-wounded-at-nye-party-in-hawthorne/||http://latimesblogs.latimes.com/lanow/2013/01/hawthorne-new-year-party-three-killed.html||https://usgunviolence.wordpress.com/2013/01/01/killed-man-hawthorne-ca/||http://www.dailybulletin.com/article/zz/20130105/NEWS/130109127
## 3                                                                                                                                                                                                                                                                                                                                              http://www.morningjournal.com/general-news/20130222/lorain-man-pleads-innocent-to-new-years-murder||http://chronicle.northcoastnow.com/2013/02/14/2-men-indicted-in-new-years-day-lorain-murder/
## 4 http://denver.cbslocal.com/2013/01/06/officer-told-neighbor-standoff-gunman-was-on-meth-binge/||http://www.westword.com/news/sonny-archuleta-triple-murder-in-aurora-guns-purchased-legally-55-57-5900504||http://www.denverpost.com/ci_22322380/aurora-shooter-was-frenetic-talented-neighbor-says||http://www.dailymail.co.uk/news/article-2258008/Sonny-Archuleta-Gunman-left-dead-latest-Aurora-shooting-lost-brother-gun-violence.html||http://www.dailydemocrat.com/20130106/aurora-shootout-killer-was-frenetic-talented-neighbor-says
## 5                                                                                                                                                                                                                                                           http://myfox8.com/2013/01/08/update-mother-shot-14-year-old-son-two-others-before-killing-herself/||http://myfox8.com/2013/01/07/police-respond-to-report-of-triple-shooting-in-greensboro/||http://www.journalnow.com/news/local/article_d4c723e8-5a0f-11e2-a1fa-0019bb30f31a.html
## 6        http://www.kjrh.com/news/local-news/4-found-shot-inside-apartment-in-tulsa||http://www.cbsnews.com/news/tulsa-apartment-murders-update-hearing-scheduled-for-brothers-charged-in-quadruple-killing/||http://www.kjrh.com/news/local-news/hearing-continues-for-fairmont-terrace-quadruple-homicide-suspect-cedric-poore-james-poore||http://www.kjrh.com/news/local-news/hearing-for-quadruple-murder-suspects-continue||http://usnews.nbcnews.com/_news/2013/01/07/16397584-police-four-women-found-dead-in-tulsa-okla-apartment?lite
##   state_house_district state_senate_district
## 1                   NA                    NA
## 2                   62                    35
## 3                   56                    13
## 4                   40                    28
## 5                   62                    27
## 6                   72                    11

Now that we have stored the database in gun_violence, it is time to tidy & modify the data in a way that helps us use it in the next stages of analysis. The “%>%” is the pipeline symbol & allows you to make a sequence of direct modifications to a dataframe.

   library(stringr)

   gun_violence_df <- gun_violence %>%
     #mutate creates a new column based on a condition
     #count number of male|female participants and add them to get n_participants
     mutate(n_male = str_count(participant_gender, 
                               "Male"))%>% 
     mutate(n_fem = str_count(participant_gender,
                              "Female")) %>%     
     mutate(n_participants = n_male + n_fem) %>%
     #count number of teens | adults
     mutate(n_adults = str_count(participant_age_group, 
                                 "Adult")) %>%
     mutate(n_teens = str_count(participant_age_group,
                                "Teen")) %>%
     #too many dates to represent as graph so convert to years
     mutate(Date = (mdy(gun_violence$date))) %>%
     mutate(Year = year(Date)) %>%
     mutate(n_victims = n_killed + n_injured) %>%
     #select creates a subset of columns to extract only the columns we want to look at
     select(incident_id, Date, Year, state,city_or_county,n_killed,
            n_injured, n_victims, n_male, n_fem, n_participants, 
            gun_type, latitude, longitude,
            n_adults, n_teens)
     

    slice(gun_violence_df, 1:10)
## # A tibble: 10 x 16
##    incident_id Date        Year state    city_or_county n_killed n_injured
##          <int> <date>     <dbl> <fct>    <fct>             <int>     <int>
##  1      461105 2013-01-01  2013 Pennsyl~ Mckeesport            0         4
##  2      460726 2013-01-01  2013 Califor~ Hawthorne             1         3
##  3      478855 2013-01-01  2013 Ohio     Lorain                1         3
##  4      478925 2013-01-05  2013 Colorado Aurora                4         0
##  5      478959 2013-01-07  2013 North C~ Greensboro            2         2
##  6      478948 2013-01-07  2013 Oklahoma Tulsa                 4         0
##  7      479363 2013-01-19  2013 New Mex~ Albuquerque           5         0
##  8      479374 2013-01-21  2013 Louisia~ New Orleans           0         5
##  9      479389 2013-01-21  2013 Califor~ Brentwood             0         4
## 10      492151 2013-01-23  2013 Maryland Baltimore             1         6
## # ... with 9 more variables: n_victims <int>, n_male <int>, n_fem <int>,
## #   n_participants <int>, gun_type <fct>, latitude <dbl>, longitude <dbl>,
## #   n_adults <int>, n_teens <int>

With this new and cleaner data, we can easily plot and express certain things we may want to see. For example we can make two separate scatter plots seeing if there is a trend in the percentage of participants that are male and then one showing the percentage of participants that are female. We will do this by using a scatter plot.

gun_violence_df %>%
  #group_by groups data around an attribute to allow you to perform math functions like sum() and mean()
  group_by(gun_violence_df$Year) %>%
  mutate(pct_male = sum(n_male) / sum(n_participants)) %>%
  #ggplot allows you to set x coordinate & y coordinate
  ggplot(aes(x=(Year), y=(pct_male))) + 
  #scatter plot
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(title = "Percentage of Male Participants", y = "%Male", x = "Year")

gun_violence_df %>%
  #group_by groups data around an attribute to allow you to perform math functions like sum() and mean()
  group_by(gun_violence_df$Year) %>%
  mutate(pct_fem = sum(n_fem) / sum(n_participants)) %>%
  #ggplot allows you to set x coordinate & y coordinate
  ggplot(aes(x=(Year), y= pct_fem)) + 
  #scatter plot
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(title = "Percentage of Female Participants", y = "%Female", x = "Year")

These are very boring plots to look at, because there aren’t many points plotted. From these plots however, we can see that in 2013 there were a lot more females involved in gun violence, nearly 10% more than the second highest year in 2014. The regression line indicates that there are less and less females involved obviously leaving more and more involvement to males when it comes to gun violence. However, since our data only covers a 5 year period I am not sure it is enough to really make a prediction from. A much more interesting plot plots the number of victims per incedent over all dates.

gun_violence_df %>%
  ggplot(aes(x = Date, y = n_victims)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(x = "Date", y = "# of Victims in Each Incedent", title= "Victims in Each Incedent Over Time")

Although most incidents same to have the same number of victims, we can see that the only incedents involving more than 25 victims from this data are in the years 2015 and beyond. To make the data even more interesting, we can isolate major cities to see how they compare. I good way to compare totals is using a bar chart. We can add up all of the gun violence victims and compare the cities.

gun_violence_df %>%
  #group by the year and city to add up the total victims from each year in each city
  group_by(city_or_county, Year) %>%
  mutate(total_vic = sum(n_victims)) %>%
  #filter to focus on key cities
  filter(city_or_county == "Baltimore" |
         city_or_county == "Chicago" |
         city_or_county == "Detroit" |
         city_or_county == "Philadelphia" |
         city_or_county == "Atlanta" | 
         city_or_county == "Washington" | 
         city_or_county == "Orlando") %>%
  ggplot(aes(x = Year, y = total_vic,   
             color = city_or_county)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(x = "Year", y = "Total Number of Victims", title = "Major City Victim Totals Over the Years")

As we can see from the data the only two cities that seem to have any sort of a trend are Philadelphia & Chicago. Philadelphia’s Total number of victims each year seems to be trending slightly downward. Chicago is experiencing the most dramatic change of the key cities and the number of victims seems to be increasing. Chicago is also way more gun violent than any of the other major cities listed here. Something I think is worth noting involves looking back at the previous plot, plotting the number of total victims in each incedent over the years. The biggest outlier is a mass shooting with over 100 victims. Lets find more information about this shooting.

max_mass_shooting <-
  #order incidents in descending order of number of victims & then slice 1 to see the max
  gun_violence_df[order(-gun_violence_df$n_victims),]

slice(max_mass_shooting, 1)
## # A tibble: 1 x 16
##   incident_id Date        Year state   city_or_county n_killed n_injured
##         <int> <date>     <dbl> <fct>   <fct>             <int>     <int>
## 1      577157 2016-06-12  2016 Florida Orlando              50        53
## # ... with 9 more variables: n_victims <int>, n_male <int>, n_fem <int>,
## #   n_participants <int>, gun_type <fct>, latitude <dbl>, longitude <dbl>,
## #   n_adults <int>, n_teens <int>

As we can see here, the largest shooting was the mass shooting in Orlando. The total number of victims, killed + injured, was 103. That is by far the biggest shooting in the database. However, even having a shooting that is double the next largest mass shooting, Orlando’s victim numbers don’t come close to rivaling the gun violence in Chicago. In fact, looking at the plot of the key cities we can see that the mass shooting in orlando accounted for over 20% of Orlando’s gun violence victims in 2016. Those 103 victims would only make up 2.8% of Chicago’s gun violence victims. My point to be made here is that while the mass shootings are absolutely shockingly tragic & terrible, I think a lot of people are overlooking the gun violence happening in this country on a daily basis. The media covers the mass shootings and shocks the nation, while I cannot think of one story of a Chicago shooting covered in the news. (at least not local news) Just to drive the point home, say we take a look at all of the shootings with over 10 victims and add them up, from all years, and then compare them to Chicago in 2016.

Aside: Here is a small article on gun violence in Chicago: https://www.quora.com/Why-do-people-use-the-term-Chiraq-for-Chicago

Continued… Now lets quickly find the victim totals from chicago in 2016 & the total sum of all incidents with victim counts of 10 or more & then directly compare them using a simple bar plot.

sum_of_mass <- filter(gun_violence_df, gun_violence_df$n_victims >= 10) #filter shootings over 10 victims

#sum the total
sum_of_mass<-sum(sum_of_mass$n_victims)
sum_of_mass
## [1] 799
#filter year 2016 & city chicago & find the sum
chiraq16 <- filter(gun_violence_df, gun_violence_df$Year == 2016 & gun_violence_df$city_or_county == "Chicago")
chiraq16 <- sum(chiraq16$n_victims)
chiraq16
## [1] 3553
#create quick matrix. Matrix(c(val1, val2), number of columns = value)
bchart<-matrix(c(sum_of_mass, chiraq16), ncol = 2)
#add col & row names to matrix
colnames(bchart) <- c("Total_Mass_Shootings","Chicago16")
rownames(bchart) <- c("Victim Totals")

#simple way of making bar plot

barplot(bchart, main = "Chicago 2016 vs Mass Shootings from 2013-2018", ylab = "Victim Totals", border = "red", density = 50)

Again, the point to be made here is that while the mass shootings are definitely horrible and tragic, the every day gun violence is often hidden behind these terrible terrible events and I think a lot of people over look the important fact that gun violence happens all the time.

Lets see if we are able to predict the number of gun violence victims there will be in a certain area on a given year based on the overall trend. That is, let’s see if there is a direct correlation between the dates & victim totals based on the data we have. Let’s look specifically at the cities we have already mentioned. The first step will be to filter the data around those specific cities.

city_gun_violence_df <- gun_violence_df %>%  
  group_by(city_or_county, Year) %>%
  mutate(total_victims = sum(n_victims)) %>%
  ungroup() %>%
  filter(city_or_county == "Baltimore" |
         city_or_county == "Chicago" |
         city_or_county == "Detroit" |
         city_or_county == "Philadelphia" |
         city_or_county == "Atlanta" | 
         city_or_county == "Washington" | 
         city_or_county == "Orlando") 
  
slice(city_gun_violence_df, 1:10) 
## # A tibble: 10 x 17
##    incident_id Date        Year state    city_or_county n_killed n_injured
##          <int> <date>     <dbl> <fct>    <fct>             <int>     <int>
##  1      492151 2013-01-23  2013 Maryland Baltimore             1         6
##  2      479554 2013-01-26  2013 Distric~ Washington            0         5
##  3      479592 2013-02-07  2013 Illinois Chicago               0         4
##  4      482771 2013-03-11  2013 Distric~ Washington            0        13
##  5      483737 2013-03-21  2013 Illinois Chicago               0         7
##  6      484268 2013-04-09  2013 Pennsyl~ Philadelphia          1         3
##  7      984353 2013-05-02  2013 Maryland Baltimore             1         0
##  8      486121 2013-05-11  2013 Pennsyl~ Philadelphia          0         4
##  9      486327 2013-05-15  2013 Michigan Detroit               1         4
## 10      486334 2013-05-16  2013 Pennsyl~ Philadelphia          0         4
## # ... with 10 more variables: n_victims <int>, n_male <int>, n_fem <int>,
## #   n_participants <int>, gun_type <fct>, latitude <dbl>, longitude <dbl>,
## #   n_adults <int>, n_teens <int>, total_victims <int>

Next lets look at a violin plot to help us visualize what value region victim counts are likely to fall under.

city_gun_violence_df  %>%
 group_by(city_or_county, Year) %>%
 mutate(avg_victim_count=mean(total_victims)) %>%
 ungroup() %>%
 ggplot(aes(x=factor(Year), y=total_victims)) +
 geom_violin(trim=FALSE, fill="gray")+
 labs(title="Total Victims Over Time",
 x="Year", y = "# of Victims")+
 geom_boxplot(width=0.1)+
 geom_point(aes(x=factor(Year), y = avg_victim_count)) +
 theme_classic()

Unfortunately the data does not fit well in a violin plot. Next we will look at some statistics using the tidy() function in the broom package.

library(broom)
tidy_data <- lm(Year~total_victims, city_gun_violence_df) %>%
 tidy()

tidy_data
##            term     estimate    std.error    statistic      p.value
## 1   (Intercept) 2.015694e+03 1.246541e-02 1.617030e+05 0.000000e+00
## 2 total_victims 2.696300e-05 6.129495e-06 4.398895e+00 1.092492e-05

Since the P value is relatively low, it does seem that time has a relationship with total victim count. If the p value is low we can reject the hypothesis that year does not have an effect on gun violence victim counts. Lets look at the residuals vs cities. Residuals are a calculated difference between expected value and actual value.

cgvdf <- city_gun_violence_df

lm(cgvdf$Year~cgvdf$city_or_county, data=cgvdf) %>%
 augment() %>%
 ggplot(aes(x=factor(cgvdf$city_or_county), y=.resid)) +
 geom_boxplot() +
 labs(title="Residual vs. City",
 x = "City",
 y = "Residual")

Since the Residuals fluctuate a lot, it seems to indicate that error is caused by other factors other than just time. This would indicate other factors play a part in the total number of victims. Lets learn more information with summarize.

summary(lm(cgvdf$total_victims~cgvdf$city_or_county*cgvdf$Year), data = cgvdf)
## 
## Call:
## lm(formula = cgvdf$total_victims ~ cgvdf$city_or_county * cgvdf$Year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2675.45   -78.94    44.49   203.87   588.22 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -5.134e+03  2.141e+04  -0.240
## cgvdf$city_or_countyBaltimore               -2.789e+04  2.473e+04  -1.128
## cgvdf$city_or_countyChicago                 -2.234e+05  2.278e+04  -9.808
## cgvdf$city_or_countyDetroit                  4.912e+04  2.643e+04   1.858
## cgvdf$city_or_countyOrlando                 -3.945e+04  3.213e+04  -1.228
## cgvdf$city_or_countyPhiladelphia             2.771e+05  2.540e+04  10.908
## cgvdf$city_or_countyWashington               3.041e+04  2.480e+04   1.226
## cgvdf$Year                                   2.666e+00  1.062e+01   0.251
## cgvdf$city_or_countyBaltimore:cgvdf$Year     1.411e+01  1.227e+01   1.150
## cgvdf$city_or_countyChicago:cgvdf$Year       1.122e+02  1.130e+01   9.925
## cgvdf$city_or_countyDetroit:cgvdf$Year      -2.433e+01  1.311e+01  -1.855
## cgvdf$city_or_countyOrlando:cgvdf$Year       1.956e+01  1.594e+01   1.227
## cgvdf$city_or_countyPhiladelphia:cgvdf$Year -1.372e+02  1.260e+01 -10.887
## cgvdf$city_or_countyWashington:cgvdf$Year   -1.500e+01  1.230e+01  -1.219
##                                             Pr(>|t|)    
## (Intercept)                                   0.8105    
## cgvdf$city_or_countyBaltimore                 0.2594    
## cgvdf$city_or_countyChicago                   <2e-16 ***
## cgvdf$city_or_countyDetroit                   0.0631 .  
## cgvdf$city_or_countyOrlando                   0.2195    
## cgvdf$city_or_countyPhiladelphia              <2e-16 ***
## cgvdf$city_or_countyWashington                0.2201    
## cgvdf$Year                                    0.8018    
## cgvdf$city_or_countyBaltimore:cgvdf$Year      0.2500    
## cgvdf$city_or_countyChicago:cgvdf$Year        <2e-16 ***
## cgvdf$city_or_countyDetroit:cgvdf$Year        0.0636 .  
## cgvdf$city_or_countyOrlando:cgvdf$Year        0.2198    
## cgvdf$city_or_countyPhiladelphia:cgvdf$Year   <2e-16 ***
## cgvdf$city_or_countyWashington:cgvdf$Year     0.2230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 469.6 on 25197 degrees of freedom
## Multiple R-squared:  0.8668, Adjusted R-squared:  0.8667 
## F-statistic: 1.261e+04 on 13 and 25197 DF,  p-value: < 2.2e-16

This data is pretty hard to interpret, so we will tidy it up to take a better look.

summary(lm(cgvdf$total_victims~cgvdf$city_or_county*cgvdf$Year), data = cgvdf) %>%
  tidy()
##                                           term      estimate   std.error
## 1                                  (Intercept) -5.133899e+03 21410.72271
## 2                cgvdf$city_or_countyBaltimore -2.788865e+04 24725.51831
## 3                  cgvdf$city_or_countyChicago -2.234132e+05 22778.98479
## 4                  cgvdf$city_or_countyDetroit  4.912289e+04 26434.21173
## 5                  cgvdf$city_or_countyOrlando -3.944732e+04 32126.43016
## 6             cgvdf$city_or_countyPhiladelphia  2.770633e+05 25400.74838
## 7               cgvdf$city_or_countyWashington  3.041336e+04 24801.99495
## 8                                   cgvdf$Year  2.666428e+00    10.62318
## 9     cgvdf$city_or_countyBaltimore:cgvdf$Year  1.411137e+01    12.26739
## 10      cgvdf$city_or_countyChicago:cgvdf$Year  1.121708e+02    11.30188
## 11      cgvdf$city_or_countyDetroit:cgvdf$Year -2.432675e+01    13.11376
## 12      cgvdf$city_or_countyOrlando:cgvdf$Year  1.955912e+01    15.93925
## 13 cgvdf$city_or_countyPhiladelphia:cgvdf$Year -1.372083e+02    12.60303
## 14   cgvdf$city_or_countyWashington:cgvdf$Year -1.499710e+01    12.30533
##      statistic      p.value
## 1   -0.2397817 8.105015e-01
## 2   -1.1279300 2.593602e-01
## 3   -9.8078642 1.143542e-22
## 4    1.8583074 6.313702e-02
## 5   -1.2278775 2.195044e-01
## 6   10.9076829 1.221263e-27
## 7    1.2262466 2.201173e-01
## 8    0.2510009 8.018155e-01
## 9    1.1503161 2.500246e-01
## 10   9.9249703 3.575673e-23
## 11  -1.8550549 6.360005e-02
## 12   1.2271046 2.197948e-01
## 13 -10.8869273 1.532453e-27
## 14  -1.2187480 2.229513e-01

Year does not seem to be a good predicting factor, so let us compare year as a sole factor vs year & city.

anova(lm(cgvdf$Year~cgvdf$total_victims, cgvdf))
## Analysis of Variance Table
## 
## Response: cgvdf$Year
##                        Df Sum Sq Mean Sq F value    Pr(>F)    
## cgvdf$total_victims     1     30 30.3292   19.35 1.092e-05 ***
## Residuals           25209  39512  1.5674                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(cgvdf$total_victims~cgvdf$city_or_county*cgvdf$Year , cgvdf))
## Analysis of Variance Table
## 
## Response: cgvdf$total_victims
##                                    Df     Sum Sq    Mean Sq  F value
## cgvdf$city_or_county                6 3.5873e+10 5978895881 27108.33
## cgvdf$Year                          1 2.8336e+07   28336227   128.48
## cgvdf$city_or_county:cgvdf$Year     6 2.5900e+08   43167238   195.72
## Residuals                       25197 5.5573e+09     220556         
##                                    Pr(>F)    
## cgvdf$city_or_county            < 2.2e-16 ***
## cgvdf$Year                      < 2.2e-16 ***
## cgvdf$city_or_county:cgvdf$Year < 2.2e-16 ***
## Residuals                                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The huge residual value would indicate that using both year & city as a predicting factor would also not be a good predictor. From these results I am forced to deduce that year and location can not be used to predict total numbers of gun violence victims.

One last cool way to observe data is through the leaflet package that allows us to look at a map using longitute & latitude recorded at these incidents to map them. Putting all of the data made my RStudio crash so I have only included the key cities we have been working with all along. Using the longitude & latitude attributes in the data frame, we can map all instances of a gun violence incident on the map. I also calculate avg lat and avg lng of each of the major cities so that I could add a Marker there indicating what City it is.

library(leaflet)
## Warning: package 'leaflet' was built under R version 3.4.4
city_gun_violence_df %<>% group_by(city_or_county) %>%
  mutate(avg_lat = mean(latitude)) %>%
  mutate(avg_lng = mean(longitude)) %>%
  ungroup()


gun_violence_map <- leaflet(city_gun_violence_df) %>%
  addTiles() %>%
  setView(lat=39.29, lng=-76.61, zoom=11) %>%
  addTiles() %>%
  addCircleMarkers(popup = 
                     paste(
      "Killed:", city_gun_violence_df$n_killed, "<br>",
      "Injured:",city_gun_violence_df$n_injured,"<br>",
      "Females:", city_gun_violence_df$n_fem, "<br>",
      "Males:", city_gun_violence_df$n_male, "<br>",
      "Date:", city_gun_violence_df$Date),
             clusterOptions = markerClusterOptions()) %>%
  addMarkers(lat = city_gun_violence_df$avg_lat, lng = city_gun_violence_df$avg_lng, popup = city_gun_violence_df$city_or_county)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 892 rows with
## either missing or invalid lat/lon values and will be ignored
## Warning in validateCoords(lng, lat, funcName): Data contains 25211 rows
## with either missing or invalid lat/lon values and will be ignored
gun_violence_map

Through my project we covered some basic techniques to gather, tidy, organize, and analyze data. Some difficult methods are analyzing attributes as predictors and studying their residuals and other statistics. Unfortunately it seems that this data set is not a good example to see clear illustrations of those methods. However I do believe I was able to show how overshadowed every day gun violence is by mass shootings. We did also learn that year & year * location cannot really be used as direct factors in the number of gun violence victims.

To find more information on gun violence, here are some links that have a lot of information to offer.

https://www.thetrace.org/newslettersignup?gclid=EAIaIQobChMIjZW-4tGO2wIVClqGCh11ygn_EAAYASAAEgL6U_D_BwE

This link includes some graphs to easily visualize some gun violence statistics

http://injuryfacts.nsc.org/home-and-community/safety-topics/firearms/data-details/?gclid=EAIaIQobChMIi92YpI6P2wIVR57ACh14EwRuEAAYAiAAEgKEVvD_BwE